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ABSTRACT 


The openness of the World Wide Web (Web) has become more exposed to 
cyber-attacks. An attacker performs the cyber-attacks on Web using malware 
Uniform Resource Locators (URLs) since it widely used by internet users. 
Therefore, a significant approach is required to detect malicious URLs and 
identify their nature attack. This study aims to assess the efficiency of the 
machine learning approach to detect and identify malicious URLs. In this 
study, we applied features optimization approaches by using a bio-inspired 
algorithm for selecting significant URL features which able to detect 
malicious URLs applications. By using machine learning approach with 
static analysis technique is used for detecting malicious URLs applications. 
Based on this combination as well as significant features, this paper shows 
promising results with higher detection accuracy. The bio-inspired algorithm: 
particle swarm optimization (PSO) is used to optimized URLs features. 


URLs In detecting malicious URLs, it shows that naive Bayes and support vector 
machine (SVM) are able to achieve high detection accuracy with rate value 
of 99%, using URL as a feature. 
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1, INTRODUCTION 

A webpages services increasingly prevail; causing business and peoples move toward web 
applications. At present, most people highly depend on web applications for routine activities such as 
communications, internet banking, online shopping, information gathering, forum discussion and socializing. 
The increasing of web applications exposed to the various threat that exploit their vulnerabilities [1, 2]. 
An attacker used web application vulnerabilities as a stepping stone to compromised URLs for hideous 
purposes [3, 4]. For instance, attackers used URL to perform an attack on websites. Attackers insert a redirect 
code into a compromised URLs so that the user will be navigated automatically to malicious URLs [5-7]. 
This malicious URLs also redirect the user to download a malicious application such as botnet into a 
computer and cause attacker able to collect confidential information such as banking number and contact 
information [8, 9]. 

Malicious URLs continuous to grow and there are 230,000 new malware samples per day [5]. 
According to Cybint News, the attackers launch their attack for every 39 seconds and have infected 64% of 
companies [10]. Due to this attack, Kaspersky Lab Solution has blocked more than 7 million attacks and 
recognizes 282,807,433 unique URLs as a malicious [11]. These malicious attempts to collect confidential 
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information such as account number and password to steal money from computers users. In addition, 
in February 2018 there is biggest attack launched by attackers, where 100 of WordPress and Joomla sites 
infected by malware known as ionCube [12]. These figures show that attackers used almost any vulnerability 
within URLs applications in order to perform an attack and exploit based attack [13]. To solve this problem, 
the users try to shield the computer by updating the version of websites applications, the anti-malware and in 
the midst of doing so, the computer user has to give constant attention to the accessed URLs application. 

Recently, studies have shown that there is a number of detection approaches available to combat the 
increasing number of malicious URLs. For an instant, the signature-based and behavioral based technique 
[14] 1s used to detect malicious URLs [15, 16]. In particular, the signature analysis aims at detecting 
malicious URLs by analyzing their signature. This signature stored in the database repository which 
represents all of the knowledge the signature-based approach has, as it concerns to malicious URLs detection. 
Furthermore, the behavioral analysis, also known as heuristic analysis detects malicious URLs by 
investigating the program in an isolated environment. Other than that, the heuristic analysis applies a 
machine learning approach and data mining to learn the behavior of executable malicious URLs. Besides, an 
identification of the most appropriate set of features: URL, host, content, graph and blacklist, help in 
efficiently distinguishing web pages and URLs into malicious benign. 

Although many security defenses are developed against malicious URLs, the nature of the security 
still has a long way to go. This paper proposes developing malicious URLs detection system which is used to 
identify new variants of known malware as well as to examine the presence of dangerous URLs seen in 
websites. The proposed study applies a heuristic based approach and collect URL features from the websites. 
Hence, the focus of this paper is to detect malicious website based on URL, the main contributions of this 
paper are the following: 

a) The evaluation study applied URLs features for malicious and benign sample from Kaggle dataset. 

b) The proposed PSO has improved the optimization of URLs features using tenfold cross-validation. 

c) The proposed naive bayes and SVM has increased the accuracy in classifying the optimized URLs 
features from malicious and benign websites applications. 

The rest of the paper is organized as follows. In Section 2 discuses related works of the research. 
Section 3 describes the methodology which includes features optimization and general architecture. Section 4 
evaluates the effectiveness of malicious URLs detection system. Lastly, Section 5 conclusion of this paper. 


2. RELATED WORK 

Malicious URLs contains vulnerabilities and poses a significant threat to the computer. 
This malicious website threat has become an important rising issue [17, 18, 19]. Many studies have been 
proposed for analyzing and detecting malicious URLs. Three types of approaches are used to detect 
malicious URLs which 1s static analysis, dynamic analysis, and heuristic analysis. 

The static analysis determines the URLs whether malicious or benign based on the extracted source 
code. Mostly, the URLs that contain suspicious code will be assigned as a malicious website. In [20] 
examined the malicious websites based on HTML codes. They analyze the characteristic of malicious URLs 
to detect malicious or not. Their results show that their approach 1s resilient to code obfuscation and able to 
determine correctly whether the URLs is malicious or not. In [21] focused on drive-by download attack to 
detect malicious URLs by using traffic in a real network. They propose two-stage drive-by download attack 
detection mechanism which examines malicious URLs based on domain reputation and applying sandboxing 
approach to monitor the network based on URL and reduce the detection time. Based on the experiment, 
they achieved 94% of accuracy and able to reduce time more than 12 times compare in real 
computing traffic. 

Many types of research concerned about the risk and impact on their computer when surfing 
website. In [18] proposed risk assessment to monitor the risk on URL by using the destination information 
when generating a short URL. By monitoring URL, any risky URL or risk over the threshold will be blocked 
to prevent malicious attack [19], especially from drive-by download through the short URL. In [20, 22] 
applied machine learning for detecting malicious websites. In [22] implemented three supervised machine 
learning techniques such as Support Vector Machine (SVM), K-Nearest Neighbor (KNN) and Naive Bayes 
(NB). Beside the supervised machine learning, they also apply unsupervised machine learning technique to 
detect malicious websites such as Affinity Propagation and K-Means. Based on the experiment, their 
proposed produced 98% of accuracy for supervised machine learning and 96% of accuracy for unsupervised 
machine learning technique. 
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3. RESEARCH METHOD 

This section describes the architecture of the experiment as it is important for executing the 
experiments. In the process for detecting malicious URLs, this paper applied optimization and machine 
learning approaches to optimize and train the sample. Optimizing and training sample are important in order 
to learn the behavior of malware and benign application. 

Figure | presents the main components of the malicious URLs detection system. There are three 
phases in detection architecture including data collection, machine learning, and database. The data collection 
begins with crawling all the URL including malware and benign website applications. Then the data are 
processed through a features selection engine to collect relevant features for training purposes. 
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Figure 1. Malicious URLs detection architecture 


3.1. Features Extraction and Selection 

A relevant of features is significant to get the high performance for the machine learning [21, 23]. 
The criteria of features are important as to present the essential characteristics of the malicious website [24]. 
The process includes removing the noise and irrelevant features in the dataset. Table | shows the list of 
URLs features used in the experiment. 

Table 1 lists the URL features for malicious URLs detection system. These features are important 
for the construction of the classification model, malicious URL detection process and identification of attack 
types. Here, malicious URLs and benign URLs features are classed in a binary number (0 or 1). It indicates 1 
if the features are an exist in URLs and O if the features are a non-exist. Then this features are used to train, 
test and features optimization on WEKA. 


Table 1. List of URLs Features 


Features Description 
Token Count The total number count of words in the URLs 
Rank Host The popularity ranking of the hostnames 
Rank Country The popularity ranking of the URLs (websites) among countries 
ASNno Autonomous System Number as the classifier for the IP of each URLs 


Sec_sen_word_cnt The security sensitive word count from the URLs 
Avg_token_lenght The total average number length count of the URLs 


No_of_dots The number of dots in the URLs 
Length_of_url The length of the URLs 
Avg_path_token The average number of the path for URLs 


3.2. Machine Learning 

This section aims to apply a machine learning approach for selecting the relevant features [25] used 
for detecting malicious URLs. In order to select the relevant features, an optimization approach is 
implemented to optimize the URLs features. This optimization approach could reduce the time for training, 
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testing and simplifying the malicious URLs detection system [20, 26]. Besides, it is also important for data 
processing [22]. Without a good knowledge of classification, it is difficult for malware analysis to identify 
relevant features for malicious URLs. Therefore, features optimization is the best approach to use for 
increase effectiveness of malicious URLs detection and improve accuracy. Hence, this study applied particle 
swarm optimization (PSO) for features optimization based on tenfold cross-validation. 

Then, the performance optimization is compared with different classifiers in order to evaluate the 
effectiveness in malicious URLs detection system. Five machine learning classifiers, namely AdaBoost, 
Support Vector Machine (SVM), K-Nearest Neighbour (KNN), Naive Bayes and Random Forest are used for 
building the machine learning model in WEKA. 


4. RESULTS AND ANALYSIS 

In To evaluate the performance of the machine learning approach in detecting malicious URLs, 
the benign and malicious URLs application were mixed together for training and testing purposes. 
The training and testing models for machine learning, the parameter including the cross-validation needs to 
be set. Table 2 illustrates the detection performance of machine learning as seen in various categories 
of classifiers. 


Table 2. Detection Performance of Machine Learning 


Classifier Accuracy TPR FPR Precision Recall F-measure 
Random Forest 97% 0.960 0.020 0.980 0.960 0.970 
Naive Bayes (This study) 99% 0.980 0.000 1.000 0.980 0.990 
k-NN 97% 0.980 0.040 0.961 0.980 0.970 
SVM(This study) 99% 0.980 0.000 1.000 0.980 0.990 
AdaBoost 97% 0.960 0.020 0.980 0.980 0.970 


Table 2 shows the detection performance of five classifiers for malicious URLs detection. 
The performance of each classifier 1s evaluated by six performance metrics such as accuracy, true positive 
rate (TPR), false positive rate (FPR), precision, recall, and f-measure. Table 2 indicates that Naive Bayes, k- 
NN and SVM recorded highest TP Rate with 0.98 compared to another two algorithms which are Random 
Forest and AdaBoost, both recorded 0.96. This meant that the three algorithms have high sensitivity towards 
malicious data. Furthermore, both Naive Bayes and SVM did not trace any false positives from the dataset 
since both recorded zeros for FPR. Other than that, both Naive Bayes and SVM recorded the highest 
precision (1) which giving precise in predicting the malicious dataset. Hence, through those recorded results, 
the naive Bayes and SVM present a better performance with 99% accuracy compared to the other classifiers. 
It is worth noting that machine learning with features optimization plays important role in identifying the 
relevant features in detecting malicious URLs. 


5. CONCLUSION 

This paper has presented the performance of the proposed approach in detecting malicious URLs. 
The proposed approach that implements the optimization has optimized the selection of URL features and the 
machine learning classifier has correctly classified the relevant malicious features. In the experiments, this 
paper considers applied real URL malware and benign samples application dataset. The experiment results 
show that the proposed approach recorded high accuracy in classifying the URLs malware samples. 
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